Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
نویسنده
چکیده
In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results con rm an increase in the performance, regardless of summarizer system used.
منابع مشابه
Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization
Preprocessing is a preliminary step in many fields including IR and NLP. The effect of basic preprocessing settings on English for text summarization is well-studied. However, there is no such effort found for the Urdu language (with the best of our knowledge). In this study, we analyze the effect of basic preprocessing settings for single-document text summarization for Urdu, on a benchmark co...
متن کاملA Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging
Text summarization is the process of distilling the most important information from a source to produce an abridged version for a particular user and task. When this is done by means of a computer, i.e. automatically, it calls as Automatic Text Summarization. Summarization can be classified into two approaches: extraction and abstraction. Extraction based summaries are produced by concatenating...
متن کاملAutomatic Semantic Subject Indexing of Web Documents in Highly In ected Languages
Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly in ec...
متن کاملImproved speech summarization with multiple-hypothesis representations and kullback-leibler divergence measures
Imperfect speech recognition often leads to degraded performance when leveraging existing text-based methods for speech summarization. To alleviate this problem, this paper investigates various ways to robustly represent the recognition hypotheses of spoken documents beyond the top scoring ones. Moreover, a new summarization method stemming from the Kullback-Leibler (KL) divergence measure and ...
متن کاملAutomatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages
Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly infle...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1209.3126 شماره
صفحات -
تاریخ انتشار 2012